184 research outputs found

    frances: a deep learning NLP and text mining web tool to unlock historical digital collections : a case study on the Encyclopaedia Britannica

    Get PDF
    Funding: This work was supported by the NLS Digital Fellowship and by the Google Cloud Platform research credit program.This work presents frances, an integrated text mining tool that combines information extraction, knowledge graphs, NLP, deep learning, parallel processing and Semantic Web techniques to unlock the full value of historical digital textual collections, offering new capabilities for researchers to use powerful analysis methods without being distracted by the technology and middleware details. To demonstrate these capabilities, we use the first eight editions of the Encyclopaedia Britannica offered by the National Library of Scotland (NLS) as an example digital collection to mine and analyse. We have developed novel parallel heuristics to extract terms from the original collection (alongside metadata), which provides a mix of unstructured and semi-structured input data, and populated a new knowledge graph with this information. Our Natural Language Processing models enable frances to perform advanced analyses that go significantly beyond simple search using the information stored in the knowledge graph. Furthermore, frances also allows for creating and running complex text mining analyses at scale. Our results show that the novel computational techniques developed within frances provide a vehicle for researchers to formalize and connect findings and insights derived from the analysis of large-scale digital corpora such as the Encyclopaedia Britannica.Postprin

    Building lightweight semantic search engines

    Get PDF
    Despite significant advances in methods for processing large volumes of structured and unstructured data, surprisingly little attention has been devoted to developing general practical methodologies that leverage state-of-the-art technologies to build domain-specific semantic search engines tailored to use cases where they could provide substantial benefits. This paper presents a methodology for developing these kinds of systems in a lightweight, modular, and flexible way with a particular focus on providing powerful search tools in domains where non-expert users encounter challenges in exploring the data repository at hand. Using an academic expertise finder tool as a case study, we demonstrate how this methodology allows us to leverage powerful off-the-shelf technology to enable the rapid, low-cost development of semantic search engines, while also affording developers with the necessary flexibility to embed user-centric design in their development in order to maximise uptake and application value.Postprin

    RepoGraph : a novel semantic code exploration tool for Python repositories based on knowledge graphs and deep learning

    Get PDF
    This work presents RepoGraph, an integrated semantic code exploration web tool that combines information extraction, knowledge graphs, and deep learning models. It offers new capabilities for software developers (from academia and industry) to represent and query Python repositories. Unlike existing tools, RepoGraph not only provides a novel search interface powered by deep learning techniques but also exposes the underlying features and representations of repositories to users. Additionally, it offers several interactive visualizations. We also introduce RepoPyOnto, a new ontology that captures the features of Python code repositories and is used by RepoGraph for representing the captured knowledge. Finally, we successfully evaluate RepoGraph against several criteria, including function summarization performance, the correctness and relevance of search results, as well as the processing time for constructing graphs of various sizes.Postprin

    Mapping the repository landscape : harnessing similarity with RepoSim and RepoSnipy

    Get PDF
    The rapid growth of scientific software development has led to the emergence of large and complex codebases, making it challenging to search, find, and compare software repositories within the scientific research community. In this paper, we propose a solution by leveraging deep learning techniques to learn embeddings that capture semantic similarities among repositories. Our approach focuses on identifying repositories with similar semantics, even when their code fragments and documentation exhibit different syntax. To address this challenge, we introduce two complementary open-source tools: RepoSim and RepoSnipy. RepoSim is a command-line toolbox designed to represent repositories at both the source code and documentation levels. It utilizes the UniXcoder pre-trained language model, which has demonstrated remarkable performance in code-related understanding tasks. RepoSnipy is a web-based neural semantic search engine that utilizes the powerful capabilities of RepoSim and offers a user-friendly search interface, allowing researchers and practitioners to query public repositories hosted on GitHub and discover semantically similar repositories. RepoSim and RepoSnipy empower researchers, developers, and practitioners by facilitating the comparison and analysis of software repositories. They not only enable efficient collaboration and code reuse but also accelerate the development of scientific software.Postprin

    inspect4py : a knowledge extraction framework for Python code repositories

    Get PDF
    This work presents inspect4py, a static code analysis framework designed to automatically extract the main features, metadata and documentation of Python code repositories. Given an input folder with code, inspect4py uses abstract syntax trees and state of the art tools to find all functions, classes, tests, documentation, call graphs, module dependencies and control flows within all code files in that repository. Using these findings, inspect4py infers different ways of invoking a software component. We have evaluated our framework on 95 annotated repositories, obtaining promising results for software type classification (over 95% F1-score). With inspect4py, we aim to ease the understandability and adoption of software repositories by other researchers and developers.Postprin

    Adaptive Optimizations for Stream-based Workflows

    Get PDF

    Laminar: A New Serverless Stream-based Framework with Semantic Code Search and Code Completion

    Full text link
    This paper introduces Laminar, a novel serverless framework based on dispel4py, a parallel stream-based dataflow library. Laminar efficiently manages streaming workflows and components through a dedicated registry, offering a seamless serverless experience. Leveraging large lenguage models, Laminar enhances the framework with semantic code search, code summarization, and code completion. This contribution enhances serverless computing by simplifying the execution of streaming computations, managing data streams more efficiently, and offering a valuable tool for both researchers and practitioners.Comment: 13 pages, 10 Figures, 6 Table

    GeoSocial: exploring the usefulness of social media mining in the applied natural geohazard sciences

    Get PDF
    Obtaining real-time information about a geohazard event as it unfolds, such as a flood or earthquake, used to be largely limited to the professional media. Nowadays, obtaining news stories from social media (e.g. Facebook, Twitter, YouTube, Flickr etc.), directly as they unfold, is becoming the ‘norm’ for many in society. The Haitian Earthquake in January 2010 and the Great East Japan Earthquake in March 2011, provided some of the first natural hazard examples, to really demonstrate the power of social media over traditional news sources for obtaining, live information from which people and authorities could gain situational awareness

    Técnicas de optimización dinámicas de aplicaciones paralelas basadas en MPI

    Get PDF
    Parallel computation on cluster architectures has become the most common solution for developing high-performance scientific applications. Message Passing Interface (MPI) [Mes94] is the message-passing library most widely used to provide communications in clusters. MPI provides a standard interface for operations such as point-to-point communication, collective communication, synchronization, and I/O operations. Along the I/O phase, the processes frequently access a common data set by issuing a large number of small non-contiguous I/O requests [NKP+96a, SR98], which might create bottlenecks in the I/O subsystem. These bottlenecks are still higher in commodity clusters, where commercial networks are usually installed. Many of those networks, such as Fast Ethernet or Gigabit, have high latency and low bandwidth which introduce performance penalties during the program execution. Scalability is also an important issue in cluster systems when many processors are used, which may cause network saturation and still higher latencies. As communication-intensive parallel applications spend a significant amount of their total execution time exchanging data between processes, the former problems may lead to poor performance not only in the I/O subsystem, but also in communication phase. Therefore, we can conclude that it is necessary to develop techniques for improving the performance of both communication and I/O subsystems. The main goal of this Ph.D. thesis is to improve the scalability and performance of MPI-based applications executed in clusters reducing the overhead of I/O and communications subsystems. In summary, this work proposes two techniques that solve these problems in an efficient way managing the high complexity of a heterogeneous environment: • Reduction in the number of communications in collective I/O operations: This thesis targets the reduction of the bottleneck in the I/O subsystem. Many applications use collective I/O operations to read/write data from/to disk. One of the most used is the Two-Phase I/O technique extended by Thakur and Choudhary in ROMIO. In this technique, many communications among the processes are performed, which could create a bottleneck. This bottleneck is still higher in commodity clusters, where commercial networks are usually installed, and in CMP clusters where the I/O bus is shared by the cores of a single node. Therefore, we propose improving locality in order to reduce the number of communications performed in Two-Phase I/O. • Reduction of transferred data volume: This thesis attemps to reduce the cost of interchanged messages, reducing the data volume by using lossless compression among processes. Furthermore, we propose turning compression on and off and selecting at run-time the most appropriate compression algorithms depending on the characteristics of each message, network performance, and compression algorithms behavior.-------------------------------------------------------------------------------------------------------------------------------------------------En la actualidad, las aplicaciones utilizadas en los entornos de computación de altas prestaciones, como por ejemplo simulaciones científicas o aplicaciones dedicadas a la extracción de datos (data-mining), necesitan además de enormes recursos de cómputo y memoria, el manejo de ingentes volúmenes de información. Las arquitecturas cluster se han convertido en la solución más común para ejecutar este tipo de aplicaciones. La librería MPI (Message Passing Interface) [Mes94] es la más utilizada en estos entornos, ya que ofrece un interfaz estándar para operaciones de comunicación punto a punto, colectivas, sincronización y de E/S. Durante la fase de E/S de las aplicaciones, los procesos acceden a un gran conjunto de datos mediante pequeñas peticiones de datos no-contiguos, por lo que pueden provocar cuellos de botella en el sistema de E/S. Estos cuellos de botella, pueden ser todavía mayor en los cluster, ya que se suelen utilizar redes comerciales como Fast Ethernet o Gigabit, las cuales tienen una gran latencia y bajo ancho de banda. Por otra parte la escalabilidad es un importante problema en los clusters, cuando se ejecutan a la vez un gran número de procesos, ya que pueden causar saturación de la red, y aumenar la latencia. Como consecuencia de una comunicación intensiva, las aplicaciones gastan mucho tiempo intercambiando información entre los procesos, provocando problemas tanto en el sistema de comunicación, como en el de E/S. Por lo tanto, podemos concluir que en un cluster los subsistemas de E/S y de comunicaciones representan uno de los principales elementos en los que conviene mejorar su rendimiento. El principal objetivo de esta Tesis Doctoral es mejorar la escalabilidad y rendimientos de las aplicaciones MPI ejecutadas en arquitecturas cluster, reduciendo la sobrecarga de los sistemas de comunicación y de E/S. Como resumen, este trabajo propone dos técnicas para resolver estos problemas de forma eficiente: • Reducción del número de comunicaciones en la operaciones colectivas de E/S: Esta tesis tiene como uno de sus objetivos reducir los cuellos de botella producidos en el sistema de E/S. Muchas aplicaciones científicas utilizan operaciones colectivas de E/S para leer/escribir datos desde/al disco. Una de las técnicas más utilizas es Two-Phase I/O ampliada por Thakur and Choudhary en ROMIO. En esta técnica se realizan muchas comunicaciones entre los procesos, por lo que pueden crear un cuello de botella. Este cuello de botella es aún mayor en los cluster que tiene instaladas redes comerciales, y en los clusters multicore donde el bus de E/S es compartido por todos los cores de un mismo nodo. Por lo tanto, nosotros proponemos aumentar la localidad y disminuir a la vez en número de comunicaciones que se producen en Two-Phase I/O para reducir los problemas de E/S en las arquitecturas cluster. • Reducción del volumen de datos en las comunicaciones: Esta tesis propone reducir el coste de las comunicaciones utilizando técnicas de compresión sin perdida. Concretamente, proponemos activar y desactivar la compresión y elegir el algoritmo de compresión en tiempo de ejecución, dependiendo de las características de cada mensaje, de la red y del comportamiento de los algoritmos de compresión

    Yagan Heritage in Tierra del Fuego (Argentina): The Politics of Balance

    Get PDF
    This paper analyses the tangible and intangible Yagan heritage contents exhibited by the Museo del Fin del Mundo (MFM, Ushuaia, Tierra del Fuego, Argentina) and presented during its guided tour led by Yagan Community Counsellor Victor Vargas Filgueira. We show how the critical outlook of Fuegian history offered in the latter challenges the traditional past-only fossilized view of the Yagan, building past–present links and helping to overcome biased hegemonic discourses. We also discuss how employing a member of the Yagan Community at the MFM has been an efficient and low-budget strategy that helps to comply with some Goals of the UN 2030 Agenda for Sustainable Development, which are difficult to attain in developing countries. Significant outcomes of this process include: (a) providing a full-time formal job to a member of an Indigenous Community who has been traditionally dispossessed of/in their own territory; (b) acknowledging him as a knowledge holder and valuable member of society; (c) moving the role of Yagan People from subject to agent of the MFM. This process has fostered the dialogue between Yagan voices and academic discourses, challenging traditional Western dichotomies-ecology/economy, natural/cultural heritage, and so forth, and contributing to the discussion of key concepts on sustainability and engagement.Fil: Fiore, Danae. Universidad de Buenos Aires. Facultad de Filosofía y Letras; Argentina. Asociación de Investigaciones Antropológicas; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Butto, Ana Rosa. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Austral de Investigaciones Científicas; ArgentinaFil: Vargas Filgueira, Victor. Museo del Fin del Mundo; Argentin
    • …
    corecore